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Abstract 

In this work we describe a Convolutional Neural Net¬ 
work (CNN) to accurately predict the scene illumination. 
Taking image patches as input, the CNN works in the spa¬ 
tial domain without using hand-crafted features that are em¬ 
ployed by most previous methods. The network consists of 
one convolutional layer with max pooling, one fully con¬ 
nected layer and three output nodes. Within the network 
structure, feature learning and regression are integrated 
into one optimization process, which leads to a more effec¬ 
tive model for estimating scene illumination. This approach 
achieves state-of-the-art performance on a standard dataset 
of RAW images. Preliminary experiments on images with 
spatially varying illumination demonstrate the stability of 
the local illuminant estimation ability of our CNN. 

1 . Introduction 

Many computer vision problems in both still images and 
videos can make use of color constancy processing as a pre¬ 
processing step to make sure that the recorded color of the 
objects in the scene does not change under different illu¬ 
mination conditions. The observed color of the objects in 
the scene depends on the intrinsic color of the object (i.e. 
the surface spectral reflectance), on the illumination, and on 
their relative positions. 

In general there are two methodologies to obtain reli¬ 
able color description from image data: computational color 
constancy and color invariance [28]. Computational color 
constancy is a two-stage operation: the former is special¬ 
ized on estimating the color of the scene illuminant from 
the image data, the latter corrects the image on the basis of 
this estimate to generate a new image of the scene as if it 
was taken under a reference light source. Color invariance 
methods instead represent images by features which remain 
unchanged with respect to specific imaging condition. 

In this work we focus on computation color constancy, 
using a CNN to learn discriminant features for the illumi¬ 
nant estimation task. Recently, deep neural networks have 


gained the attention of numerous researchers outperform¬ 
ing state-of-the-art approaches on various computer vision 
tasks [25, 26]. One of CNNs advantages is that it can 
take raw images as input and incorporate feature learning 
into the training process. With a deep structure, CNN can 
learn complicated mappings while requiring minimal do¬ 
main knowledge. 

To the best of our knowledge, this is the first work that 
investigates the use of CNNs for illuminant estimation. The 
main contribution of our paper is that we propose a novel 
method that allows learning and prediction of scene illumi¬ 
nant on local regions. Previous approaches typically accu¬ 
mulate features over the entire image to obtain statistics for 
estimating the overall illuminant, and only a few approaches 
have shown the ability to estimate spatially varying illumi¬ 
nant. By contrast, our method can estimate the illuminant 
on small patches (such as 32 x 32). We show experimen¬ 
tally that the proposed method advances the state-of-the-art 
on a standard dataset of RAW images. In addition to the su¬ 
perior overall performance, we also show quantitative and 
qualitative results that demonstrate the quality of the local 
illuminant estimation of our method. 

2. Problem formulation and related works 

The image values for a Lambertian surface located at 
the pixel with coordinates {x,y) can be seen as a function 
p(x, y), mainly dependent on three physical factors: the il¬ 
luminant spectral power distribution /(x, A), the surface 

spectral reflectance S{x, y, A) and the sensor spectral sensi¬ 
tivities C(A). Using this notation p(x, y) can be expressed 
as 

P{x,y) = J I{x,y,\)S{x,y,\)C{\)dX, (1) 

where A is the wavelength, p and C(A) are three-component 
vectors and the integration is performed over the visible 
spectrum. The goal of color constancy is to estimate the 
color I(x, y) of the scene illuminant, i.e. the projection of 
/(x, y^ A) on the sensor spectral sensitivities C(A): 

l{x,y) = I I{x,y,\)C{\)dX. (2) 
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Usually the illuminant color is estimated up to a scale fac¬ 
tor as it is more important to estimate the chromaticity of 
the scene illuminant than its overall intensity [21]. Since 
the only information available are the sensor responses p 
across the image, color constancy is an under-determined 
problem [14] and thus further assumptions and/or knowl¬ 
edge are needed to solve it. 

Several computational color constancy algorithms have 
been proposed, each based on different assumptions. The 
most common assumption made a uniform light source 
color across the scene, i.e. I(x, y) = I. 

State-of-the-art solutions can be divided into two main 
classes: statistic approaches, and learning-based ap¬ 

proaches. Statistic approaches estimate the scene illumi¬ 
nation only on the base of the content in a single image 
making assumptions about the nature of color images ex¬ 
ploiting statistical or physical properties; learning-based ap¬ 
proaches require training data in order to build a statistical 
image model, prior to estimation of illumination. 

2.1. Statistic-based algorithms 

Van de Weijer et al. [33] have unified a variety of algo¬ 
rithms. These algorithms estimate the illuminant color I by 
implementing instantiations of the following equation: 

l{n,p,a) = ^(^jjdxAy^ , (3) 

where n is the order of the derivative, p is the Minkowski 
norm, p^{x,y) = p{x,y) (g) Ga{x,y) is the convolution 
of the image with a Gaussian filter Ga{x,y) with scale 
parameter a, and k is a. constant to be chosen such that 
the illuminant color I has unit length (using the 2—norm). 
The integration is performed over all pixel coordinates. 
Different (n,p, cr) combinations correspond to different 
illuminant estimation algorithms, each based on a different 
assumption. For example, the Gray World algorithm [6] 
- generated setting (n,p, a) = (0,1,0) - is based on 
the assumption that the average color in the image is 
gray and that the illuminant color can be estimated as 
the shift from gray of the averages in the image color 
channels; the White Point algorithm [ ] - generated setting 
(n,p, cr) = (0,oc,0) - is based on the assumption that 
there is always a white patch in the scene and that the 
maximum values in each color channel are caused by 
the refiection of the illuminant on the white patch, and 
they can be thus used as the illuminant estimation; the 
Gray Edge algorithm [33] - generated setting for example 
(n,p, a) = (1, 0, 0) - is based on the assumption that the 
average color of the edges is gray and that the illuminant 
color can be estimated as the shift from gray of the averages 
of the edges in the image color channels. 

The Gamut Mapping assumes that for a given illuminant, 
one observes only a limited gamut of colors [13]. It has 


a preliminary phase in which a canonical illuminant is 
chosen and the canonical gamut is computed observing as 
many surfaces under the canonical illuminant as possible. 
Given an input image with an unknown illuminant, its 
gamut is computed and the illuminant is estimated as the 
mapping that can be applied to the gamut of the input 
image, resulting in a gamut that lies completely within the 
canonical gamut and produces the most colorful scene. If 
the spectral sensitivity functions of the camera are known, 
the Color by Correlation approach could be also used [12]. 

2.2. Learning-based algorithms 

The learning-based color constancy algorithms, that es¬ 
timate the scene illuminant using a model that is learned 
on training data, can be subdivided into two main subcat¬ 
egories: probabilistic methods and fusion/selection based 
methods. Bayesian approaches [16] model the variability 
of refiectance and of illuminant as random variables, and 
then estimate illuminant from the posterior distribution con¬ 
ditioned on image intensity data. 

Given a set computational color constancy algorithms, in 
[1] an image classifier is trained to classify the images as 
indoor and outdoor, and different experimental frameworks 
are proposed to exploit this information in order to select 
the best performing algorithm on each class. In [2] it has 
been shown how intrinsic, low level properties of the im¬ 
ages can be used to drive the selection of the best algorithm 
(or the best combination of algorithms) for a given image. 
The algorithm selection and combination is made by a de¬ 
cision forest composed of several trees on the basis of the 
values of a set of heterogeneous features. 

In [17] the Weibull parametrization has been used to train 
a maximum likelihood classifier based on mixture of Gaus- 
sians to select the best performing color constancy method 
for a certain image. 

In [8] a statistical model for the spatial distribution of col¬ 
ors in white balanced images is developed, and then used 
to infer illumination parameters as those being most likely 
under their model. High level visual information has been 
used to select the best illuminant out of a set of possible il- 
luminants [34]. This is achieved by restating the problem 
in terms of semantic interpretability of the image. Several 
color constancy methods are applied to generate a set of il¬ 
luminant hypotheses. For each illuminant hypothesis, they 
correct the image, evaluate the likelihood of the semantic 
content of the corrected image, and select the most likely 
illuminant color. In [3, 4] the use of automatically detected 
objects having intrinsic color is investigated. In particular, it 
is investigated how illuminant estimation can be performed 
exploiting the color statistics extracted from the faces auto¬ 
matically detected in the image. When no faces are detected 
in the image, any other algorithm in the state-of-the-art can 


be used. In [23, 24] the surfaces in the image are exploited 
and the color constancy problem is addresses by unsuper¬ 
vised learning of an appropriate model for each training sur¬ 
face in training images. The model for each surface is de¬ 
fined using both texture features and color features. In a test 
image the nearest neighbor model is found for each surface 
and its illumination is estimated by comparing the statistics 
of pixels belonging to nearest neighbor surfaces and the tar¬ 
get surface. The final illumination estimation results from 
combining these estimated illuminants over surfaces to gen¬ 
erate a unique estimate. 

3. The proposed approach 

The proposed framework of using CNN for illuminant 
estimation is as follows. Given color image, we sample 
non-overlapping patches from it and for each of them we 
perform a contrast normalization through histogram stretch¬ 
ing. We use a CNN to estimate the illuminant for each patch 
and combine the patch scores to obtain an illuminant esti¬ 
mation for the image. 

3.1. Network architecture 

The proposed network consists of five layers. Figure 1 
shows the architecture of our network, which is a 32x32x3 
- 32x32x240 - 4x4x240 - 40 - 3 structure. The input is con¬ 
trast normalized 32x32 image patches. The first layer is a 
convolutional layer which filters the input with 240 kernels 
each of size 1x1x3 with a stride of 1 pixel. The convolu¬ 
tional layer produces 240 feature maps each of size 32x32, 
followed by a max-pooling operation with 8x8 kernels and 
stride of 8 pixels that reduces each feature map to a 4x4 fea¬ 
ture map. These are reshaped into a 3840 (4x4x240) vector. 
One fully connected layer of 40 nodes come after the re¬ 
shaping. The last layer is a simple linear regression with a 
three dimensional output that gives the illuminant estimate. 

3.2. Contrast normalization 

In order to be robust across lighting condition and since 
in color constancy the illuminant has to be estimated up to 
a scale factor, all the extracted patches are contrast normal¬ 
ized. Among the different contrast enhancement techniques 
we have chosen the global histogram stretching as does not 
change the relative contributions of the three color channels. 

3.3. Pooling 

In the convolution layer, the contrast normalized image 
patches are convolved with 240 filters and each filter gener¬ 
ates a feature map. We then apply max-pooling on each fea¬ 
ture map to reduce the filter responses to a lower dimension. 
In contrast with object recognition scenario, where pooling 
tends to be performed on small neighborhoods, we observe 
that even in case of spatially varying illuminants, these are 


locally homogeneous, i.e. the same illuminant tends to be 
present on all the the locations of a 32x32 patch. This per¬ 
mits the use of larger pooling kernels. 

3.4. ReLU nonlinearity 

Instead of traditional sigmoid or tanh neurons, we use 
Rectified Linear Units (ReLUs) [29] in the fully connected 
layer. Krizhevsky et al. demonstrated that that ReLUs en¬ 
able the network to train several times faster compared to 
using tanh units while achieving almost identical perfor¬ 
mance [26]. 

3.5. CNN features 

Together with learning an ad-hoc CNN for the color 
constancy problem, we also investigate how a pre-trained 
one works on this problem. To this end, we extract a 
4096-dimensional feature vector from each image using the 
Caffe [22] implementation of the deep CNN described by 
Krizhevsky et al. [26]. Features are computed by forward 
propagation of a mean-subtracted 227 x 227 RGB RAW 
image through five convolutional layers and two fully con¬ 
nected layers. More details about the network architecture 
can be found in [26, 22]. The CNN was discriminatively 
trained on a large dataset (ILSVRC 2012) with image-level 
annotations to classify images into 1000 different classes. 
Features are obtained by extracting activation values of the 
last hidden layer. The extracted features are then used as 
input to a linear Support Vector Regressor (SVR) [10] to 
estimate the illuminant color for each image. We refer to 
this method as AlexNeUSVR in the experimental results. 

4. Experimental Setup 

The aim of this section is to investigate if the proposed 
algorithm can outperform state-of-the-art algorithms in the 
illuminant estimation on a standard dataset of RAW images. 

4.1. Image Datasets and Evaluation Procedure 

To test the performance of the proposed algorithm, a 
standard dataset of RAW camera images having a known 
color target are used. It is captured using high-quality digi¬ 
tal SLR cameras in RAW format, and is therefore free of any 
color correction. The dataset [16] was originally available 
in sRGB-format, but Shi and Funt [31] reprocessed the raw 
data to obtain linear images with a higher dynamic range 
(14 bits as opposed to standard 8 bits). The dataset has been 
acquired using a Canon 5D and a Canon ID DSLR cameras 
and consists of a total of 568 images. The Macbeth Col- 
orChecker (MCC) chart is included in every scene acquired, 
and this allows to accurately estimate the actual illuminant 
of each acquired image. Examples of images within the 
RAW dataset are reported in Figure 2. 
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Figure 1. The architecture of our CNN. 



Figure 2. Example of images within the RAW dataset. 


4.2. Error metric 

The error metric considered, as suggested by Hordley 
and Finlayson [21], is the angle between the RGB triplet of 
estimated illuminant (p^) and the RGB triplet of the mea¬ 
sured ground truth illuminant 

cang = arccos ( ,, • (4) 

4.3. Benchmark algorithms 

Different benchmarking algorithms for color constancy 
are considered. Since each image of the dataset contains 
only one MCC, only global color constancy algorithms 
based on the assumption of uniform illumination can 
be compared. Six of them are generated varying the 
three variables (n,p, cr) in Equation 3, and correspond to 
well known and widely used color constancy algorithms. 
The values chosen for (n,p, a) are reported in Table 
1 and set as in [19]. The algorithms are used in the 
original authors’ implementation which is freely available 
online (http : //lear . inrialpes . f r/people/ 


vandewei jer/code/ColorConstancy . zip). The 
seventh algorithm is the pixel-based Gamut Mapping [18]. 
The value chosen for a is also reported in Table 1. The 
other algorithms considered are illumination chromaticity 
estimation via Support Vector Regression (SVR [15]); the 
Bayesian (BAY [16]); the Natural Image Statistics (NIS 
[17]); the High Level Visual Information [34]: bottom-up 
(HLVI BU), top-down (HLVI TD), and their combination 
(HLVI BU&TD); the Spatio-Spectral statistics [8]: with 
Maximum Likelihood estimation (SS ML), and with 
General Priors (SS GP); the Automatic color constancy 
Algorithm Selection (AAS) [2] and the Automatic Algo¬ 
rithm Combination (AAC) [2]; the Exemplar-Based color 
constancy (EB) [23]; the Lace-Based (LB) color constancy 
algorithm [3] using GM or SS ML when no faces are 
detected. 

Table 1. Values chosen for (n,p, a) for the state-of-the-art algo¬ 
rithms which are instantiations of Eq.3. 


Algorithm 

n 

p 

a 

Gray World (GW) 

0 

1 

0 

White Point (WP) 

0 

oc 

0 

Shades of Gray (SoG) 

0 

4 

0 

general Gray World (gGW) 

0 

9 

9 

Ist-order Gray Edge (GEl) 

1 

1 

6 

2nd-order Gray Edge (GE2) 

2 

1 

1 

Gamut Mapping (GM) 

0 

0 
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The last algorithm considered is the Do Nothing (DN) 
algorithm which gives the same estimation for the color of 
the illuminant (I = [111]) for every image, i.e. it assumes 
that the image is already correctly balanced. 

4.4. CNN learning 

We train our CNN on 32x32 random patches taken from 
images in RAW format. Images have been resized to 
max(u;, h) = 1200. The net is learned using a thee-fold 










































cross validation on the folds provided with the dataset: for 
each run one is used for training, one for validation and the 
remaining one for test. For training, we assign each patch 
with the illuminant groundtruth associated to the image to 
which it belongs. At testing time, we generate a single il¬ 
luminant estimation per image by pooling the the predicted 
patch illuminants. By taking image patches as input, we 
have a much larger number of training samples compared 
to using the whole image on a given dataset, which partic¬ 
ularly meets the needs of CNNs. Net parameters have been 
learned using Caffe [22] with euclidean loss. The learned 
net is then fine-tuned by using as loss the angular error and 
adding knowledge about the way local estimates are pooled 
to generate a single global estimate for each image. 

5. Results and Discussion 

In Table 2 the minimum, the 10^^-percentile, the median, 
the average, the 90^^-percentile, and the maximum of the 
angular errors obtained by the considered state-of-the-art 
algorithms and the proposed approach on the RAW dataset 
are reported. The table is divided into three blocks and for 
each of them the best result for each statistic is reported in 
bold. The first block includes statistic-based algorithms, the 
second one learning-based algorithms, and the third one the 
different variants of the proposed approach. 

From the results it is possible to see that the deep CNN 
pre-trained on ILSVRC 2012 [26] coupled with SVR (i.e. 
AlexNeHSVR) is already able to outperform most statistic- 
based algorithms and some learning-based ones. The next 
entry is the angular error made by our CNN on the patches. 
It is possible to see that, with respect to the median error, 
the proposed approach is able to outperform half of the 
learning-based algorithms considered. The next two en¬ 
tries are the results obtained by our approach by pooling 
patch-based illuminant estimations over the whole image. 
Two very simple pooling strategies have been considered, 
i.e. average and median pooling. It is possible to see that 
both outperform most of the state-of-the-art algorithms in 
terms of median error, with the average-pooling having a 
maximum error just 0.3% worse than the best algorithm in 
the state-of-the-art. The last entry reported is the fine-tuned 
CNN with median-pooling and angular error loss. We can 
notice that it reaches both a median, an average and a max¬ 
imum angular error better than all the state-of-the-art algo¬ 
rithms considered. The improvement is of 1.5%, 5.1% and 
0.2% respectively but it is remarkable that they have been 
achieved by the same algorithm, while the best values in 
the state-of-the-art for the same statistics were obtained by 
three different learning-based algorithms, i.e. FB-fGM, EB 
and SS GP respectively. 

Figure 3 reports some examples of images on which the 
fine-tuned CNN makes the largest estimation errors. From 
left to right we report the original RAW image, the image 


corrected with the groundtruth illuminant, the image cor¬ 
rected with the CNN estimate, and the image corrected with 
the algorithm in the state-of-the-art making the best esti¬ 
mate on that image. Once we have an estimate of the global 
illuminant color I, each pixel in the image is color cor¬ 
rected using the von Kries model [35], i.e.: = 

diag{l-^)Pi^{x,y). 

5.1. Effects of parameters 

Several parameters are involved in the CNN design. In 
this section, we examine how these parameters affect the 
network performance. The network architecture has been 
chosen by starting from a 7-layer deep CNN similar to [26] 
and removing layers until no further improvement in per¬ 
formance was possible. In Figure 4 we report how the 
different parameters affect the illuminant estimation perfor¬ 
mance. Each point on the graphs represents the best result 
that can be obtained by fixing a parameter at the value in¬ 
dicated and searching over all the possible combinations of 
the other ones. 

Kernel size Figure 4. a shows how the performance 
varies with the width of the convolution kernels. We can 
see from Figure 4. a that the estimation error decreases by 
decreasing the kernel size. At first this could be surpris¬ 
ing, since in different domains larger kernels are preferred. 
However, it is not the first time that such small kernels are 
used, see [32]. From the color constancy point of view, this 
choice of kernel size confirms the finding of Cheng at al. 
[9], where they showed that spatial information does not 
provide any additional information that cannot be obtained 
directly from the color distributions. 

Number of kernels Figure 4.b shows how the perfor¬ 
mance varies with respect to the number of convolution ker¬ 
nels. It is possible to see that the CNN tends to prefer an 
intermediate number of kernels. 

Pooling size Figure 4.c shows how the performance 
varies with respect to the number of convolution kernels. 
It is possible to see that the CNN tends to prefer an interme¬ 
diate pooling size. 

Number of fully connected units Figure 4.d shows how 
the performance varies with respect to the number of fully 
connected units. The plot shows that better performance can 
be reached with a number of fully connected units around 
40. 

Patch size Since in our experiment the illuminant is es¬ 
timated as the median illuminant of all patches sampled, 
we examine how the patch size affects performance. For 
every patch size, the same number of patches is randomly 
extracted from each image being sure that none of them 
contained the reference color chart. Figure 4.e shows the 
change of performance with respect to patch size. From the 
plot we see that a larger patch size results in better perfor¬ 
mance. 


Table 2. Angular error statistics obtained by the state-of-the-art algorithms considered on the RAW dataset. 


Algorithm 

Min 

lO^^prc 

Med 

Avg 

90^^prc 

Max 

DN 

3.72 

10.38 

13.55 

13.62 

16.45 

27.37 

GW 

0.18 

1.88 

6.30 

6.27 

10.12 

24.84 

WP 

0.08 

1.38 

5.61 

7.46 

15.68 

40.59 

SoG 

0.18 

1.04 

4.04 

4.85 

9.71 

19.93 

gGW 

0.03 

0.82 

3.45 

4.60 

9.68 

22.21 

GEl 

0.16 

1.82 

4.55 

5.21 

9.78 

19.69 

GE2 

0.26 

2.06 

4.43 

5.01 

8.93 

16.87 

GM 

0.05 

0.40 

2.28 

4.10 

11.08 

23.18 

SVR 

0.66 

3.36 

6.67 

7.99 

14.61 

26.08 

BAY 

0.10 

1.17 

3.44 

4.70 

10.21 

24.47 

NIS 

0.08 

0.93 

3.13 

4.09 

8.57 

26.20 

HLVI BU 

0.06 

0.75 

2.54 

3.30 

6.59 

17.51 

HLVI TD 

0.11 

0.85 

2.63 

3.65 

7.53 

25.24 

HLVI BU&TD 

0.13 

0.77 

2.47 

3.38 

6.97 

25.24 

SS ML 

0.06 

0.85 

2.93 

3.55 

7.23 

15.25 

SS GP 

0.07 

0.82 

2.90 

3.47 

7.00 

14.80 

AAS 

0.03 

0.77 

3.16 

4.18 

9.15 

22.21 

AAC 

0.05 

0.90 

2.90 

3.74 

7.93 

14.98 

EB 

0.14 

0.73 

2.24 

2.77 

5.52 

19.44 

FB-hGM 

0.05 

0.40 

2.01 

3.67 

9.50 

23.18 

FB+SS GP 

0.08 

0.75 

2.57 

3.18 

6.67 

14.80 

AlexNet-nSVR 

0.12 

0.98 

3.09 

4.74 

11.18 

29.15 

CNN per patch 

0.00 

0.99 

2.69 

3.67 

7.79 

30.93 

CNN average-pooling 

0.04 

0.99 

2.44 

3.18 

6.37 

14.84 

CNN median-pooling 

0.06 

0.97 

2.32 

3.07 

6.15 

19.04 

CNN fine-tuned 

0.06 

0.69 

1.98 

2.63 

5.54 

14.77 


5.2. Local illuminant estimation 

Our CNN predicts the illumination on small image 
patches, so it can be easily used to predict local illuminants 
as well as giving a global illuminant estimate for the en¬ 
tire image. Given the performance of the per patch error 
in Table 2 we expect our CNN to perform well even on lo¬ 
cal estimation. We perform here a preliminary test by using 
our learned CNN as-is on a dataset of synthetic images: the 
images are taken from the previous RAW dataset [31] and 
on half of each image we manually changed the illuminant, 
resulting in two illuminants for each image. 

Among the algorithms in the state-of-the-art able to deal 
with non-uniform illumination, e.g. [27, 30, 11,5, 24, 4] 
we report as comparison the results of the Multiple Light 
Sources (MLS) [20] using White Point (WP) and Gray 
World (GW) algorithms, grid based sampling, in the clus¬ 
tering version setting the number of clusters equal to the 
number of lights in the scene, i.e. two. 

The numerical results are reported in Table 3, while a 
couple of examples are given in Figure 5. For the proposed 
approach we report three different entries: the hrst one is 
the error on each patch; the second and third ones are the 
patch-by-patch errors by taking into account the spatial ar¬ 
rangement of the patches to perform a spatial hltering of the 


local estimates: the former employs a 3x3 Gaussian hlter, 
the latter a 3x3 median hlter. 

6. Conclusions 

In this work we have developed a CNN for color con¬ 
stancy. Our algorithm combines feature learning and re¬ 
gression as a complete optimization process, which enables 
us to employ modern training techniques to boost perfor¬ 
mance. The experimental results showed that our algorithm 
achieves state of the art performance on a standard dataset 
of RAW images outperforming 21 algorithms in the state- 
of-the-art belonging to both statistic-based and learning- 
based classes. Furthermore, a preliminary test, shows that 
our algorithm can be adapted to estimate local illuminants. 

As future work we plan to investigate other pooling 
strategies to combine patch-based illuminant estimations 
into a global one. We plan also to investigate if additional 
information can be fed to the CNN to further improve the 
performance. We will also conduct a more thorough study 
about the extension of the proposed approach to local il¬ 
luminant estimation, conducting the experiments on larger 
datasets and comparing with more algorithms in the state- 
of-the-art. 
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Figure 3. Examples of images on which the fine-tuned CNN makes the largest estimation errors, 
correction with the groundtruth illuminant, correction with the CNN estimate, and correction with the algorithm in the state-of-the-art 
making the best estimate on the given image. 


Table 3. Angular error statistics obtained on the synthetic RAW dataset of images with spatially varying illumination. 


Algorithm 

Min 

10*'*prc 

Med 

Avg 

90*'*prc 

Max 

DN 

5.90 

9.99 

13.38 

13.62 

17.09 

27.71 

MLS+GW 

0.12 

3.35 

8.03 

8.72 

14.66 

32.98 

MLS+WP 

0.23 

2.59 

6.09 

7.03 

13.15 

33.58 

CNN per patch 

0.00 

1.01 

2.83 

3.72 

7.78 

31.78 

CNN gaussian filtering 

0.01 

0.99 

2.71 

3.50 
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Figure 4. Effects of the parameters on the CNN performance. Angular error with respect to varying convolution kernel width (a), number 
of convolutional kernels (b), pooling size (c), number of fully connected units (d) and input patch size (e). Each point corresponds to the 
best performance that can be obtained by fixing a single parameter at the value indicated by trying all the combinations for the other ones. 



Eigure 5. Examples of local illumination estimation. Left to right: original image, groundruth illumination, local estimation, local angular 
error map. 
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